Detecting image purpose in World Wide Web documents

نویسندگان

  • Seungyup Paek
  • John R. Smith
چکیده

The number of World-Wide Web (WWW) documents available to users of the Internet is growing at an incredible rate. Therefore, it is becoming increasingly important to develop systems that aid users in searching, ltering, and retrieving information from the Internet. Currently, only a few prototype systems catalog and index images in Web documents. To greatly improve the cataloging and indexing of images on the Web, we have developed a prototype rule-based system that detects the content images in Web documents. Content images are images that are associated with the main content of Web documents, as opposed to a multitude of other images that exist in Web documents for di erent purposes, such as decorative, advertisement and logo images. We present a system that uses decision tree learning for automated rule induction for the content image detection system. The system uses visual features, text-related features and the document context of images in concert for fast and e ective content image detection in Web documents. We have evaluated the system by collecting more than 1200 images from 4 di erent Web sites and we have achieved an overall classi cation accuracy of 84%.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Unified Approach to Indexing Multimedia on the Web

Indexing multimedia Web documents can be regarded as an important part of Web engineering, a concept first proposed [19] by one of the authors and his collaborators in 1998 at the World Wide Web WWW7 conference in Brisbane, Australia. Contentbased indexing of multimedia has always been a challenging task. The enormity and diversity of the multimedia content on the World Wide Web (WWW) adds anot...

متن کامل

Image and Video Searching on the World Wide Web

The proliferation of multimedia on the World Wide Web has led to the introduction of Web search engines for images, video, and audio. On the Web, multimedia is typically embedded within documents that provide a wealth of indexing information. Harsh computational constraints imposed by the economics of advertising-supported searches restrict the complexity of analysis that can be performed at qu...

متن کامل

A Near-duplicate Detection Algorithm to Facilitate Document Clustering

Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting t...

متن کامل

Unsupervised clustering for nontextual web document classification

While the breath of vocabulary used in long documents may mislead the traditional keyword-based retrieval systems, the demands for techniques in nontextual Web classification and retrieval from a large document collection are mounting. Only a few prototype systems have attempted to classify hypertext on the basis of nontextual elements in order to locate unfamiliar documents. As a result, a lar...

متن کامل

Application of Radon Transform in Detecting Turning Angle of Bodies and in Reading Multi - Lingual Documents

Recently, image processing technique and robotic vision are widely applied in fault detection of industrial products as well as document reading. In order to compare the captured images from the target, it is necessary to prepare a perfect image, then matching should be applied. A preprocessing must therefore, be done to correct the samples’ and or camera’s movement which can occur during the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998